15 research outputs found
GPU-ArraySort: A parallel, in-place algorithm for sorting large number of arrays
Modern day analytics deals with big datasets from diverse fields. For many application the data is in the form of an array which consists of large number of smaller arrays. Existing techniques focus on sorting a single large array and cannot be used for sorting large number of smaller arrays in an efficient manner. Currently no such algorithm is available which can sort such large number of arrays utilizing the massively parallel architecture of GPU devices. In this paper we present a highly scalable parallel algorithm, called GPU-ArraySort, for sorting large number of arrays using a GPU. Our algorithm performs in-place operations and makes minimum use of any temporary run-time memory. Our results indicate that we can sort up to 2 million arrays having 1000 elements each, within few seconds. We compare our results with the unorthodox tagged array sorting technique based on NVIDIAs Thrust library. GPU-ArraySort out-performs the tagged array sorting technique by sorting three times more data in a much smaller time. The developed tool and strategy will be made available at https://github.com/pcdslab
MS-REDUCE: An ultrafast technique for reduction of Big Mass Spectrometry Data for high-throughput processing
Modern proteomics studies utilize high-throughput mass spectrometers which can produce data at an astonishing rate. These big Mass Spectrometry (MS) datasets can easily reach peta-scale level creating storage and analytic problems for large-scale systems biology studies. Each spectrum consists of thousands of peaks which have to be processed to deduce the peptide. However, only a small percentage of peaks in a spectrum are useful for peptide deduction as most of the peaks are either noise or not useful for a given spectrum. This redundant processing of non-useful peaks is a bottleneck for streaming high-throughput processing of big MS data. One way to reduce the amount of computation required in a high-throughput environment is to eliminate non-useful peaks. Existing noise removing algorithms are limited in their data-reduction capability and are compute intensive making them unsuitable for big data and high-throughput environments. In this paper we introduce a novel low-complexity technique based on classification, quantization and sampling of MS peaks We present a novel data-reductive strategy for analysis of Big MS data. Our algorithm, called MS-REDUCE, is capable of eliminating noisy peaks as well as peaks that do not contribute to peptide deduction before any peptide deduction is attempted. Our experiments have shown up to 100x speed up over existing state of the art noise elimination algorithms while maintaining comparable high quality matches. Using our approach we were able to process a million spectra in just under an hour on a moderate server. The developed tool and strategy will be available to wider proteomics and parallel computing community at the authors webpages after the paper is published
An Out-of-Core GPU based dimensionality reduction algorithm for Big Mass Spectrometry Data and its application in bottom-up Proteomics
Modern high resolution Mass Spectrometry instruments can generate millions of spectra in a single systems biology experiment. Each spectrum consists of thousands of peaks but only a small number of peaks actively contribute to deduction of peptides. Therefore, pre-processing of MS data to detect noisy and non-useful peaks are an active area of research. Most of the sequential noise reducing algorithms are impractical to use as a pre-processing step due to high time-complexity. In this paper, we present a GPU based dimensionality-reduction algorithm, called G-MSR, for MS2 spectra. Our proposed algorithm uses novel data structures which optimize the memory and computational operations inside GPU. These novel data structures include Binary Spectra and Quantized Indexed Spectra (QIS). The former helps in communicating essential information between CPU and GPU using minimum amount of data while latter enables us to store and process complex 3-D data structure into a 1-D array structure while maintaining the integrity of MS data. Our proposed algorithm also takes into account the limited memory of GPUs and switches between in-core and out-of-core modes based upon the size of input data. G-MSR achieves a peak speed-up of 386x over its sequential counterpart and is shown to process over a million spectra in just 32 seconds. The code for this algorithm is available as a GPL open-source at GitHub at the following link: https://github.com/pcdslab/G-MSR
GPU-PCC: A GPU Based Technique to Compute Pairwise Pearson’s Correlation Coefficients for Big fMRI Data
Functional Magnetic Resonance Imaging (fMRI) is a non-invasive brain imaging technique for studying the brain’s functional activities. Pearson’s Correlation Coefficient is an important measure for capturing dynamic behaviors and functional connectivity between brain components. One bottleneck in computing Correlation Coefficients is the time it takes to process big fMRI data. In this paper, we propose GPU-PCC, a GPU based algorithm based on vector dot product, which is able to compute pairwise Pearson’s Correlation Coefficients while performing computation once for each pair. Our method is able to compute Correlation Coefficients in an ordered fashion without the need to do post-processing reordering of coefficients. We evaluated GPU- PCC using synthetic and real fMRI data and compared it with sequential version of computing Correlation Coefficient on CPU and existing state-of-the-art GPU method. We show that our GPU-PCC runs 94.62× faster as compared to the CPU version and 4.28× faster than the existing GPU based technique on a real fMRI dataset of size 90k voxels. The implemented code is available as GPL license on GitHub portal of our lab at https://github.com/pcdslab/GPU-PCC
The Parallelism Motifs of Genomic Data Analysis
Genomic data sets are growing dramatically as the cost of sequencing
continues to decline and small sequencing devices become available. Enormous
community databases store and share this data with the research community, but
some of these genomic data analysis problems require large scale computational
platforms to meet both the memory and computational requirements. These
applications differ from scientific simulations that dominate the workload on
high end parallel systems today and place different requirements on programming
support, software libraries, and parallel architectural design. For example,
they involve irregular communication patterns such as asynchronous updates to
shared data structures. We consider several problems in high performance
genomics analysis, including alignment, profiling, clustering, and assembly for
both single genomes and metagenomes. We identify some of the common
computational patterns or motifs that help inform parallelization strategies
and compare our motifs to some of the established lists, arguing that at least
two key patterns, sorting and hashing, are missing
Evaluating the Potential of Disaggregated Memory Systems for HPC applications
Disaggregated memory is a promising approach that addresses the limitations
of traditional memory architectures by enabling memory to be decoupled from
compute nodes and shared across a data center. Cloud platforms have deployed
such systems to improve overall system memory utilization, but performance can
vary across workloads. High-performance computing (HPC) is crucial in
scientific and engineering applications, where HPC machines also face the issue
of underutilized memory. As a result, improving system memory utilization while
understanding workload performance is essential for HPC operators. Therefore,
learning the potential of a disaggregated memory system before deployment is a
critical step. This paper proposes a methodology for exploring the design space
of a disaggregated memory system. It incorporates key metrics that affect
performance on disaggregated memory systems: memory capacity, local and remote
memory access ratio, injection bandwidth, and bisection bandwidth, providing an
intuitive approach to guide machine configurations based on technology trends
and workload characteristics. We apply our methodology to analyze thirteen
diverse workloads, including AI training, data analysis, genomics, protein,
fusion, atomic nuclei, and traditional HPC bookends. Our methodology
demonstrates the ability to comprehend the potential and pitfalls of a
disaggregated memory system and provides motivation for machine configurations.
Our results show that eleven of our thirteen applications can leverage
injection bandwidth disaggregated memory without affecting performance, while
one pays a rack bisection bandwidth penalty and two pay the system-wide
bisection bandwidth penalty. In addition, we also show that intra-rack memory
disaggregation would meet the application's memory requirement and provide
enough remote memory bandwidth.Comment: The submission builds on the following conference paper: N. Ding, S.
Williams, H.A. Nam, et al. Methodology for Evaluating the Potential of
Disaggregated Memory Systems,2nd International Workshop on RESource
DISaggregation in High-Performance Computing (RESDIS), November 18, 2022. It
is now submitted to the CCPE journal for revie
Enabling Massive Peptide Library Search Using GPU-FLASH (Cascadia Proteomics Symposium 2017)
Microbiome research has opened new frontiers in public health and environmental stewardship. However, protein sequence database for these complex microbial communities are often incomplete or unavailable, which limits options for spectral annotation. Spectral library search is an efficient method for MS/MS identification, but library sizes can be prohibitively large for microbiome research. Standard techniques apply a precursor ion window to filter candidates for an exact match, which can easily overlook many possible homologous matches. Emerging open library search methods appear promising, but have yet to be tested at the scale necessary for microbial communities. This calls for an efficient open spectral library search approach which can perform open search across a spectral library within a reasonable time frame. As a solution we present a GPU-accelerated, highly efficient pairwise similarity algorithm which can shortlist candidate spectra from a spectral library after performing all to all comparison. Our preliminary results show that open search for 35,000 spectra against a library of 1.18 million spectra takes approximately 45 mins, which is similar to a database search for a small bacterial organism
Recommended from our members
ADEPT: a domain independent sequence alignment strategy for gpu architectures.
BackgroundBioinformatic workflows frequently make use of automated genome assembly and protein clustering tools. At the core of most of these tools, a significant portion of execution time is spent in determining optimal local alignment between two sequences. This task is performed with the Smith-Waterman algorithm, which is a dynamic programming based method. With the advent of modern sequencing technologies and increasing size of both genome and protein databases, a need for faster Smith-Waterman implementations has emerged. Multiple SIMD strategies for the Smith-Waterman algorithm are available for CPUs. However, with the move of HPC facilities towards accelerator based architectures, a need for an efficient GPU accelerated strategy has emerged. Existing GPU based strategies have either been optimized for a specific type of characters (Nucleotides or Amino Acids) or for only a handful of application use-cases.ResultsIn this paper, we present ADEPT, a new sequence alignment strategy for GPU architectures that is domain independent, supporting alignment of sequences from both genomes and proteins. Our proposed strategy uses GPU specific optimizations that do not rely on the nature of sequence. We demonstrate the feasibility of this strategy by implementing the Smith-Waterman algorithm and comparing it to similar CPU strategies as well as the fastest known GPU methods for each domain. ADEPT's driver enables it to scale across multiple GPUs and allows easy integration into software pipelines which utilize large scale computational systems. We have shown that the ADEPT based Smith-Waterman algorithm demonstrates a peak performance of 360 GCUPS and 497 GCUPs for protein based and DNA based datasets respectively on a single GPU node (8 GPUs) of the Cori Supercomputer. Overall ADEPT shows 10x faster performance in a node-to-node comparison against a corresponding SIMD CPU implementation.ConclusionsADEPT demonstrates a performance that is either comparable or better than existing GPU strategies. We demonstrated the efficacy of ADEPT in supporting existing bionformatics software pipelines by integrating ADEPT in MetaHipMer a high-performance denovo metagenome assembler and PASTIS a high-performance protein similarity graph construction pipeline. Our results show 10% and 30% boost of performance in MetaHipMer and PASTIS respectively